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ABSTRACT 

We consider the classical tree edit distance between ordered 
labeled trees, which is defined as the minimum-cost sequence 
of node edit operations that transform one tree into an- 
other. The state-of-the-art solutions for the tree edit dis- 
tance are not satisfactory. The main competitors in the field 
either have optimal worst-case complexity, but the worst 
case happens frequently, or they are very efficient for some 
tree shapes, but degenerate for others. This leads to unpre- 
dictable and often infeasible runtimes. There is no obvious 
way to choose between the algorithms. 

In this paper we present RTED, a robust tree edit distance 
algorithm. The asymptotic complexity of RTED is smaller 
or equal to the complexity of the best competitors for any 
input instance, i.e., RTED is both efficient and worst-case 
optimal. We introduce the class of LRH (Left- Right-Heavy) 
algorithms, which includes RTED and the fastest tree edit 
distance algorithms presented in literature. We prove that 
RTED outperforms all previously proposed LRH algorithms 
in terms of runtime complexity. In our experiments on syn- 
thetic and real world data we empirically evaluate our solu- 
tion and compare it to the state-of-the-art. 

1. INTRODUCTION 

Tree structured data appears in many applications, for 
example, XML documents can be represented as ordered la- 
beled trees. An interesting query computes the difference 
between two trees. This is useful when dealing with differ- 
ent versions of a tree (for example, when synchronizing file 
directories or archiving Web sites) , or to find pairs of similar 
trees (for example, for record linkage or data mining). The 
standard approach to tree differences is the tree edit dis- 
tance, which computes the minimum-cost sequence of node 
edit operations that transform one tree into another. The 
tree edit distance has been applied successfully in a wide 
range of applications, such as bioinformatics [1, 20, 26], im- 
age analysis [6], pattern recognition [23], melody recogni- 
tion [19], natural language processing [25], or information 
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extraction [21, 14], and has received considerable attention 
from the database community [2, 8, 12, 13, 17, 18, 24]. 

The tree edit distance problem has a recursive solution 
that decomposes the trees into smaller subtrees and subfor- 
est. The best known algorithms are dynamic programming 
implementations of this recursive solution. Only two of these 
algorithms achieve 0(n 2 ) space complexity, where n is the 
number of tree nodes: The classical algorithm by Zhang and 
Shasha [31], which runs in 0(n 4 ) time, and the algorithm by 
Demaine et al. [15], which runs in 0(n 3 ) time. Demaine's 
algorithm was shown to be worst-case optimal, i.e., no im- 
plementation of the recursive solution can improve over the 
cubic runtime for the worst case. The runtime complexity 
is given by the number of subproblems that must be solved. 

The runtime behavior of the two space-efficient algorithms 
heavily depends on the tree shapes and it is hard to choose 
between the algorithms. Zhang's algorithm runs efficiently 
in 0(n 2 log 2 n) time for trees with depth O(logn), but it 
runs into the 0(n 4 ) worst case for some other shapes. De- 
maine's algorithm is better in the worst case, but unfor- 
tunately the 0(n 3 ) worst case happens frequently, also for 
tree shapes for which Zhang's algorithm is almost quadratic. 
The runtime complexity can vary by more than a polyno- 
mial degree, and the choice of the wrong algorithm leads 
to a prohibitive runtime for tree pairs that could otherwise 
be computed efficiently. There is no easy way to predict 
the runtime behavior of the algorithms for a specific pair of 
trees, and this problem has not been addressed in literature. 

In this paper we develop a new algorithm for the tree edit 
distance called RTED. Our algorithm is robust, i.e., inde- 
pendent of the tree shape, the number of subproblems that 
RTED computes is at most as high as the number of sub- 
problems the best competitor must compute. In many cases 
RTED beats the competitors and is still efficient when they 
run into the worst case. RTED requires 0(n 2 ) space as the 
most space-efficient competitors, and its runtime complexity 
of 0(n 3 ) in the worst case is optimal. 

The key to our solution is our dynamic decomposition 
strategy. A decomposition strategy recursively decomposes 
the input trees into subforests by removing nodes. At each 
recursive step, either the leftmost or the rightmost root node 
must be removed. Different choices lead to different numbers 
of subproblems and thus to different runtime complexities. 
Zhang's algorithm always removes from the right, Demaine 
removes the largest subtree in a subforest last, after remov- 
ing all nodes to the left and then all nodes to the right of 
that subtree. Our algorithm dynamically chooses one of the 
above strategies, and we show that the choice is optimal. 
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We develop a recursive cost formula for the optimal strat- 
egy and we present an algorithm that computes the strategy 
in 0(n 2 ) time and space. The computation of the strategy 
does not increase space or runtime complexity of the tree 
edit distance algorithm. Our experimental evaluation con- 
firms the analytic results and shows that the time used to 
compute the strategy is small compared to the overall run- 
time, and the percentage decreases with the tree size. Sum- 
marizing, the contribution of this paper is the following: 

• We introduce the class of LRH algorithms and the gen- 
eral tree edit distance algorithm, GTED, which imple- 
ments any LRH strategy in 0(n 2 ) space. 

• We present an efficient algorithm that computes the 
optimal LRH strategy for GTED in 0(n 2 ) time and 
space. The strategy computation does not increase the 
overall space or runtime complexity and takes only a 
small percentage of the overall runtime. 

• We present RTED, our robust tree edit distance algo- 
rithm. For any tree pair, the number of subproblems 
computed by RTED is at most the number of subprob- 
lems computed by any known LRH algorithm. 

• We empirically evaluate RTED and compare it to the 
state-of-the-art algorithms. To the best of our knowl- 
edge, this is the first experimental evaluation of the 
state-of-the-art in computing the tree edit distance. 

The rest of the article is organized as follows. Section 2 
provides background material, Section 3 defines the prob- 
lem, and Section 4 introduces GTED. In Section 5 we ana- 
lyze decomposition strategies for GTED and introduce the 
robust tree edit distance algorithm RTED in Section 6. Sec- 
tion 7 discusses related work. We experimentally evaluate 
our solution in Section 8 and conclude in Section 9. 

2. NOTATION AND BACKGROUND 

We introduce our notation and recap basic concepts of the 
the tree edit distance computation. 

2.1 Notation 

A tree T is a directed, acyclic, connected graph with nodes 
N(T) and edges E(T) C N(T) x N(T), where each node has 
at most one incoming edge. A forest F is a graph in which 
each connected component is a tree; each tree is also a forest. 
Each node has a label, which is not necessarily unique. The 
nodes of a forest F are strictly and totally ordered such that 
v > w for any edge (v,w) G E(F). The tree traversal that 
visits all nodes in ascending order is the postorder traversal. 

In an edge (v, w), node v is the parent and w is the child, 
p(w) — v. A node with no parent is a root node, a node 
without children is a leaf. A node x is an ancestor of node v 
iff a; = p(v) or x is an ancestor of p(v); £ is a descendant of 
v iff v is an ancestor of x. A node v is to the left (right) of w 
iff v < w (v > w) and v is not a descendant (ancestor) of w. 
rh{F) and tr(F) are respectively the leftmost and rightmost 
root nodes in F; if F is a tree, then r(F) = tl(F) — rn(F). 

A subforest of a tree T is a graph with nodes N' C N(T) 
and edges E' = {(v,w) | (v,w) G E(T),v G N',w G N'}. 
T v is the subtree rooted in node v of T iff T v is a subforest of 
T and N(T V ) — {x | x = v or x is a descendant of v in T}. 
A path 7 in F is a connected subforest of F in which each 
node has at most one child. 



We use the following short notation: By \F\ = \N(F)\ we 
denote the size of F, we write v G F for v G N{F), and 
denote the empty forest with 0. F — v is the forest obtained 
from F by removing node v and all edges at v. By F — F v 
we denote the forest obtained from F by removing subtree 
F v . F — 7 is the set of subtrees of F obtained by removing 
path 7 from F: F — 7 = {F v : v £ 7 /\3x(x G 7 Ax = p(v))}. 

Example 1. The nodes of tree T in Figure 1 are N(T) = 
{vi,v 2 ,v 3 ,v 4 ,v 5 }, the edges are E(T) = {(v 1 ,v 2 ), («i,« 5 ), 
(«i,«4), (vs,V3)}, the node labels are shown in italics in the 
figure. The root of T is r(T) = vi, and \T\ = 5. T V5 with 
nodes N(T V5 ) = {v&,V3} and edges E(T V5 ) = {(115,113)} is a 
subtree ofT. T — v\ is the subforest ofT with N(T — vi) = 
{v 2 ,v 3 ,v 4 ,v 5 } and E(T-vi) = {(v 5 ,v 3 )}, r L (T-vi) = v 2 , 
vr{T - vi) = v 4 . 



delete node u 5 T rename node V5 to 1 

v\,a •* iii,o " 



vi,a 



V2,b v 3 ,e v\,d V2,b v§,c v 4 ,d v 2 ,b v$,x v 4 ,d 



v 3 ,e 



v 3 ,e 



insert node v$ rename node V5 to x 

Figure 1: Example trees and edit operations. 

2.2 Recursive Solution for Tree Edit Distance 

The tree edit distance, S(F, G), is defined as the minimum- 
cost sequence of node edit operations that transforms F into 
G. We use the standard edit operations [15, 31]: delete a 
node and connect its children to its parent maintaining the 
order; insert a new node between an existing node, v, and 
a consecutive subsequence of v's children; rename the label 
of a node (see Figure 1). The costs are Cd(v) for deleting v, 
Ci(v) for inserting v, and c r (v,w) for renaming v to w. 

The tree edit distance has the recursive solution shown 
in Figure 2 [31]. The distance between two forests F and 
G is computed in constant time from the solutions of three 
or four (depending on whether the two forests are trees) 
of the following smaller subproblems: (1) S(F — v,G), (2) 
S(F,G-w), (3) 5(F V , G w ), (4) 5(F — F V ,G — G w ), and (5) 
S(F — v, G — w). The nodes v and w are either both the 
leftmost (v — tl{F), w = vl(G)) or both the rightmost 
(v — vr(F), w = vr(G)) root nodes of the respective forests. 
The subproblems that result from recursively decomposing 
F and G are called the relevant subproblems. The number 
of relevant subproblems depends on the choice of the nodes 
v and w at each recursive step. 

2.3 Dynamic Programming Algorithms 

The fastest algorithms for the tree edit distance are dy- 
namic programming implementations of the recursive solu- 
tion. Since each subproblem is computed in constant time 
from other subproblems, the runtime complexity of these 
algorithms is equal to the number of different relevant sub- 
problems they produce. Decomposing a tree with the recur- 
sive formula in Figure 2 can result in a quadratic number 
of subforests. Thus the space complexity of a straight for- 
ward algorithm, which stores the distance between all pairs 
of subforests of two trees F and G, is 0(\F\ 2 \G\ 2 ). 

Fortunately, the required storage space can be reduced 
to 0(|F||G|) by computing the subproblems bottom-up and 
reusing space. To achieve this goal, the quadratic space 
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6(0,0) = 0, 

8(F,0) = 8(F-v,0) + c d (v), 

(5(0, G) = 5{®,G-w) + a(w), 

if F is not a tree or G is not a tree: 

'8(F-v,G)+c d (v) (1) 

8{F,G) = min I 8(F,G - w) + a(w) (2) 

8(F V ,G W ) + 8{F-F V ,G-G W ) (3), (4) 
if F is a tree and G is a tree: 

' 8{F-v,G)+c d (v) (1) 
<5(F,G) = min \s{F,G- w) + a{w) (2) 
(5(F-w,G-w) + c r (u,w) (5) 

Figure 2: Recursive formula for Tree Edit Distance. 



solutions restrict the choice of v and w in each recursive 
step. Two solutions have been presented in literature. The 
algorithm by Zhang and Shasha [31] always chooses the same 
direction (v and w are either both leftmost or both rightmost 
root nodes). Demaine et al. [15] switch between left- and 
rightmost root nodes depending on a predefined path. 

While the recursive formula is symmetric, i.e., S(F, G) and 
8(G, F) produce the same set of subproblems, this does not 
necessarily hold for its dynamic programming implementa- 
tions. The bottom-up strategies used in these algorithms 
compute and store subproblems for later use. When the 
direction is allowed to change, the subproblems that are re- 
quired later are hard to predict and only a subset of the 
precomputed subproblems is actually used later. Thus, in 
addition to the direction, a decomposition strategy must also 
choose the order of the parameters. 

3. PROBLEM DEFINITION 

As outlined in the previous section, a dynamic program- 
ming algorithm that implements the recursive solution of 
the tree edit distance must choose a direction (left or right) 
and the order of the input forests at each recursive step. 
The choices at each step form a strategy, which determines 
the overall number of subproblems that must be computed. 

In this paper we introduce the class of path strategies (cf. 
Section 4). Path strategies can be expressed using a set 
of non-overlapping paths that connect tree nodes to leaves. 
The choice at each recursive step depends on the position 
of the paths. The class of path strategies is of particular 
interest since only for this class quadratic space solutions are 
known. An LRH strategy is a path strategy which uses only 
left, right, and heavy paths. LRH strategies are sufficient to 
express strategies with optimal asymptotic complexity. An 
algorithm based on an LRH strategy is an LRH algorithm. 
The most efficient tree edit distance algorithms presented in 
literature [15, 22, 31] fall into the class of LRH algorithms. 

The state-of-the-art in path algorithms is not satisfac- 
tory. All previously proposed path algorithms degenerate, 
i.e., they run into their worst case although a better path 
strategy exists. The difference can be a polynomial degree, 
leading to highly varying and often infeasible runtimes. 

The goal of this paper is to develop a new LRH algorithm 
which combines the features of previously proposed algo- 
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Figure 3: Subforests of full decomposition. 



rithms (worst case guarantees for time and space) and in 
addition is robust, i.e., it does not degenerate. More specif- 
ically, the algorithm should have the following properties: 

• space-efficient: the space complexity should be 0(n 2 ), 
which is the best known complexity for a tree edit 
distance algorithm; 

• optimal runtime: the runtime complexity should be 
0(n 3 ), which has shown to be optimal among all possi- 
ble strategies for the recursive formula in Figure 2 [15]; 

• robust: the algorithm should not run into the worst 
case if a better LRH strategy exists. 

Our solution is the RTED algorithm, which satisfies all 
above requirements. In addition we show that for any in- 
stance the number of subproblems computed by RTED is 
smaller or equal to the number of subproblems computed 
by all previously proposed LRH algorithms. 

4. A GENERAL ALGORITHM FOR PATH 
STRATEGIES 

In this section we introduce the GTED algorithm, which 
generalizes most of the existing tree edit distance algorithms. 

4.1 Relevant Subforests and Subtrees 

The subforests that result from decomposing a tree with 
the recursive formula in Figure 2 are called the relevant sub- 
forests. The set of all subforests that can result from any de- 
composition is called the full decomposition. It results from 
repeated removal of the rightmost or leftmost root node. An 
example is shown in Figure 3. 

Definition 1. The full decomposition of a tree F, A(F), 
is the set of all subforests of F obtained by recursively re- 
moving the leftmost and rightmost root nodes, 7\l(F) and 
vr(F), from F and the resulting subforests: 

.4(0) = 

A(F) = {F} U A(F - r L (F)) U A(F - r R (F)) 

Next we show how to decompose a tree into subtrees and 
subforest based on a so-called root-leaf path, a path that 
connects the root node of the tree to one of its leaves. The 
set of all root-leaf paths of F is denoted as 7*(F). The left 
path, 7 L (F), the right path, ^/ R (F), and the heavy path, 
j H (F), recursively connect a parent to its leftmost child, its 
rightmost child, or to the child which roots the largest sub- 
tree, respectively. Decomposing trees with root-leaf paths is 
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Figure 4: Relevant subtrees and subforests. 

essential for the path strategies discussed in the next sub- 
section. 

The relevant subtrees of tree F for some root-leaf path 7 
are all subtrees that result from removing 7 from F, i.e., all 
subtrees of F that are connected to a node on the path 7. 

Definition 2. The set of relevant subtrees of a tree F with 
respect to a root-leaf path 7 € 7* (F) is defined as F — 7. 

The relevant subforests of F for some root-leaf path are 
F itself and all subforests obtained by removing nodes in 
the following order: (1) remove the root of F and stop if no 
nodes are left, (2) remove the leftmost root node in the re- 
sulting forest until the leftmost root node is on the root-leaf 
path, (3) remove the rightmost root node until the rightmost 
root node is on the root-leaf path, (4) recursively repeat the 
procedure for the resulting subtree. 

Definition 3. The set of relevant subforests of a tree F 
with respect to a root-leaf path 7 G 7*(F) is recursively 
defined as 



F(0, 7 ) = 
F(F, 7 ) = {F}U 



[r(F-r R {F),i) itr L {F)ej 
1 T(F — t\l(F), 7) otherwise 



path are circled. The other forests in the box are the rele- 
vant subforests with respect to the root-leaf path. The solid 
lines connect trees and their relevant subtrees. 

Recursive path decomposition. So far, we have consid- 
ered relevant subtrees and subforests with respect to a single 
root-leaf path. If we assume that also for each of the result- 
ing relevant subtrees of a tree F a root-leaf path is defined, 
we can recursively continue the decomposition as follows: 
(1) produce all relevant subforest of F with respect to some 
root-leaf path 7, (2) recursively apply this procedure to all 
relevant subtrees of F with respect to 7. 

The set of paths that is used to recursively decompose F is 
called path partitioning of F and is denoted as V(F). A path 
partitioning is a set of non-overlapping paths that all end in 
a leaf node and cover the tree. We define the left path parti- 



Example 2. The leftmost tree in each dotted box in Fig- 
ure 4 is a relevant subtree, and the nodes on the root-leaf 



tioningofFto be T L {F) = {l L {F)}U{J FveF _ jL(F) T l (F v ); 
the right path partitioning, T R (F), is defined analogously. 

The set of relevant subforests that result from recursively 
decomposing tree F with a path partitioning Y = F(F) is 

F(F,r)=F(F, 7F )u U F(F',r), (i) 

F'£F- 7f 

where 7 f is the root-leaf path of F in V, i.e., the only element 
of 7*(F) n r. The set of relevant subtrees that results from 
recursively decomposing F with the path partitioning T is 
T(F,r) = {F}uU^ eF _ 7F T(F',r). 

4.2 Path Strategies 

A decomposition strategy chooses between leftmost and 
rightmost root node at each recursive step, resulting in a set 
of relevant subforests, which is a subset of the full decom- 
position. We define a class of decomposition strategies that 
uses root-leaf paths to decompose trees. 

Definition 4- A path strategy S for two trees F and G 
maps each pair of subtrees (F V ,G W ), v G F, w G G, to a 
root-leaf path in one of the subtrees, 7*(F„) U 7*(G™). An 
LRH strategy is a path strategy that maps subtree pairs to 
left, right, and/or heavy paths. 

Path strategies determine the choice of left vs. right at 
each step in the recursive tree edit distance solution. A sim- 
ple algorithm that uses a given path strategy in the recursive 
solution colors the nodes on the paths and works as follows: 
Initially, all nodes are white. If both F and G are trees 
with white root nodes, color the nodes on path S(F, G) (the 
path to which strategy S maps the pair of subtrees (F, G)) in 
black, all other nodes in F and G are white. Let B G {F, G} 
be the tree/forest with a colored root node b. If the leftmost 
root of B, rz,(F), is black, then v = 7\r(F) and w = vr(G) in 
the recursive formula, otherwise v = vl(F) and w = tl(G). 

4.3 Quadratic Space Implementation 

Space-efficient implementations of the tree edit distance 
use a bottom-up approach, which computes distances be- 
tween smaller pairs of subtrees first. By carefully order- 
ing the subtree computations, storage space can be reused 
without violating the preconditions for later subtree compu- 
tations. The quadratic space algorithms presented in liter- 
ature are based on a distance matrix and a quadratic space 
function which fills the distance matrix. 

The distance matrix D stores the distances between pairs 
of subtrees of F and G. We represent D as a set of triples 
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(F v ,G w ,d), where d — 5{F V ,G W ) is the tree edit distance 
between F v and G w , v G F , w G G; the transposed distance 
matrix is defined as D T = {(G w , F v ,d) j (F v , G w , d) G D}. 

A single-path function, A(F, G, 7f, D), computes dis- 
tances between the subtrees of two trees F and G for a 
given root-leaf path jf in F. More specifically, the single- 
path function between two trees F and G computes the 
distance between all subtrees of F that are rooted in the 
root-leaf path y F and all subtrees of G. The precondi- 
tion is that the distances between all relevant subtrees of 
F with respect to 7f and all subtrees of G are stored in 
the distance matrix D, {(F v , G w , d)\F v G F — 7, w G G,d — 
8(F v ,G m )} C D, which is part of the input. The output is 
D out = Du{(F v ,G w ,d)\v G 7f,w € G,d = 5{F V ,G W )}. 

Two single-path functions with quadratic space complex- 
ity have been proposed in literature. They differ in the types 
of path (left, right, any) they can process and in the number 
of subproblems they need to compute. The single-path func- 
tion by Demaine et al. [15] (called "compute period" in their 
paper and computed for every node on the path) works for 
any type of path and computes \ J-(F, 7f)| x \ A(G) | subprob- 
lems. The single-path function by Zhang and Shasha [31] 
(their "tree distance function" can be adapted to be a single- 
path function) can only process left paths, but needs to com- 
pute less subproblems, |F(F,7jr)| x \J r (G,T L (G))\, where 
J-(G, F L (G)) C A(G) is the set of subforests obtained by 
recursively decomposing G using only left root-leaf paths. 
The symmetric version of this function processes only right 
paths with similar complexity bounds. We denote the above 
single- path functions with A 7 , A L , and A R , respectively. 

4.4 A General Tree Edit Distance Algorithm 

The general tree edit distance algorithm (GTED, Algo- 
rithm 1) computes the tree edit distance for any path strat- 
egy S in quadratic space. The input is a pair of trees, F 
and G, the strategy S, and the distance matrix D, which 
initially is empty. The algorithm fills the distance matrix 
with the distance between all pairs of subtrees from F and 
G (including F and G themselves). 

GTED works as follows. It looks up the root-leaf path 7 
in strategy S for the pair of input trees (F, G). If 7 is a path 
in the left-hand tree F, then 

• GTED is recursively called for tree G and every rele- 
vant subtree of tree F with respect to path 7, 

• the single-path function that corresponds to the type 
of path 7 is called for the pair of input trees F and G. 

If 7 is a path in the right-hand tree G, then GTED is called 
with the trees swapped. Conceptually, this requires to trans- 
pose the strategy S and the distance matrix D; finally, also 
the output must be transposed to get the original order of 
F and G in the distance matrix. In the implementation, the 
transposition is a flag that indicates the tree order. 

The space complexity of GTED is quadratic. Both strat- 
egy and distance matrix are of size |F||G|. The single-path 
functions require 0(|F||G|) space and release the memory 
when they terminate. Only one such function is called at the 
same time. The runtime complexity depends on the strat- 
egy and can widely vary between 0(|F||G|) and 0(|F| 2 |G| 2 ). 
Below we briefly discuss the strategies of the most important 
tree edit distance algorithms presented in literature. 

4.5 Generalization of Previous Work 



Algorithm 1: GTED(F,G,S,D) 



1 7 ^5(F,G) 

2 if 7 G 7*(F) then 

3 forall the F' G F - 7 do 

4 [D<-flU GTED(F', G, S, D) 

5 if 7 = 7 L (F) then U^DuA 1 (F, G, 7, D) 

6 else if 7 = 7 K (F) then fl<-DU A R (F, G, 7, D) 

7 else D <- Du A 1 (F, G, 7, D) 



8 else D<-DU (GTED(G, F, S 1 , D 1 )) J 

9 return D 



Our general tree edit distance algorithm, GTED, gener- 
alizes much of the previous work on the tree edit distance, 
i.e., most algorithms presented in literature are equivalent 
to GTED with a specific path strategy. We discuss the 
strategies that turn GTED into the algorithms of Zhang 
and Shasha [31], Klein [22], and Demaine et al. [15]. 

The strategy followed by Zhang and Shasha [31] maps all 
pairs of subtrees (F V ,G W ), v G F, w G G, to the left path, 
7 L (F„), resulting in an algorithm with runtime 0(n 4 ) in 
the worst case. The symmetric strategy maps all pairs of 
subtrees to the right path, j R (F v ), with the same runtime 
complexity. Klein [22] maps all pairs of subtrees to the heavy 
path, 7 H (F„), achieving runtime 0(n 3 logn). 

The strategies discussed above use only the root-leaf paths 
in F for the decomposition. Demaine et al. [15] use both 
trees and map all pairs of subtrees (F v , G w ), v G F, w G G 
to 7 H (F„) if IF, I > \G W \ and to ~/ H (G w ) otherwise. Thus, 
in each recursive step, the larger tree is decomposed using 
the heavy path. The resulting algorithm has runtime 0(n 3 ). 

5. COST OF PATH STRATEGIES 

In this section, we count the number of relevant subprob- 
lems that must be computed by a single-path function and 
the overall GTED algorithm. We develop a cost formula, 
which allows us to count the number of relevant subprob- 
lems of the optimal LRH strategy for any pair of input trees. 

5.1 Relevant Subproblems 

A relevant subproblem is a pair of relevant subforest, for 
which a distance must be computed during the recursive tree 
edit distance evaluation. The number of relevant subprob- 
lems depends on the decomposition strategy and determines 
the runtime complexity of the respective algorithm. The set 
of relevant subproblems is always a subset of A(F) x A(G). 

5.2 Complexity of the Single-Path Functions 

We count the number of relevant subproblems that the 
single-path functions A L , A R , and A 7 compute for a pair 
of trees, F and G, and a given path 7 G 7*(F). We need 
to derive the number of subforests in the full decomposition 
of a tree, the decomposition with a single path, and the 
recursive decomposition with a set of paths. 

Lemma 1. The number of the subforests in the full path 
decomposition of tree F is \A(F)\ = |F|(l f +3) - £„ eF \F V \. 

Proof. Proof by [16]. □ 
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Lemma 2. The number of relevant subforests of tree F 
with respect to a root-leaf path, 7 G 7*(F), is equal to the 
number of nodes in F, |.F(F, 7)! = F|. 

Proof. The proof is by induction on the size of F. Basis: 
For \F\ = 1, |J"(F,7)| = \F\ holds due to Definition 3 of 
J-{F,^). Induction step: We assume, that \J-{F, r y)\ — \F\ 
holds for all trees F k of size k. Then, for a tree F k+ \ of size 
k + 1 with Definition 3: |.F(F fc+ i, 7 )| = \{F k+1 } U J r (F k+1 - 
v,j)\ = l + |F(F fe+1 -«,7) = l + k since \F k+1 -v\ = k. □ 

With the results of Lemma 1 and 2, the cardinalities of 
both A(F) and J-(F, 7) are independent of the path used to 
decompose tree F and can be computed in linear time in a 
single traversal of F for the tree and all its subtrees. 

Lemma 3. The number of relevant subforests produced 
by a recursive path decomposition of tree F with path par- 
titioning r is the sum of the sizes of all the relevant subtrees 
in the recursive decomposition: 

|F(F,r)| = J2 \ F '\ 

F'eT(F,r) 

PROOF. Follows from the definition of T(F, V) in Equa- 
tion (1) and Lemma 2. □ 

Example 3. Figure 4 shows different recursive path de- 
compositions. All path decompositions in the figure produce 
different numbers of relevant subforests, and they all produce 
less subforests than the full decomposition in Figure 3. 

The next lemma counts the number of relevant subprob- 
lems that the different single-path functions must compute. 

Lemma 4- The number of relevant subproblems com- 
puted by the single-path functions A 7 , A L , and A R , for 
a pair of trees F and G is as follows (D is the distance 
matrix) : 

• A 7 (F,G,7,£>), 7 G 7*(F): \F\ x \A(G)\ 

• A L (F,G,j L (F),D): \F\ x |F(G,r L (G))| 

• A R (F, G, -y R (F), D): \F\ x \JF(G,T R (G))\ 

Proof. Demaine et al. [15] show that the number of 
subproblems computed by A 7 (F, G, 7, -D), 7 G 7*(F), is 
\T(F, 7) x |-4(G) I . Zhang and Shasha [31] show the num- 
ber of relevant subproblems for A L (F, G,y L (F), D) to be 
\HF,1 L {F))\ x \T(G,r L (G))\; |F(F, 7 L (F))| = |F| follows 
from Lemma 2. The same rationale holds for A R . □ 

5.3 Cost of the Optimal Strategy 

With the results in Section 5.2 we can compute the cost 
of GTED for any path strategy, i.e., the number of relevant 
subproblems that GTED must compute. The overall cost 
of a strategy is computed by decomposing a tree into its 
relevant subtrees according to the paths in the strategy and 
summing up the costs for executing the single-path function 
for each pair of relevant subtrees. 

We compute the cost of the optimal LRH strategy. This is 
achieved by an exhaustive search in the space of all possible 
LRH strategies for a given pair of trees. At each recur- 
sive step, there are six choices. Either F is decomposed by 
one of left, right, or heavy path, or G is decomposed. For 



each of the six choices, the cost of the relevant subtrees that 
result from the decomposition must be explored. The opti- 
mal strategy is computed by the formula in Figure 5, which 
chooses the minimum cost at each recursive step. The cost 
formula counts the exact number of relevant subproblems 
for the optimal LRH strategy between two tree. 

cost (F,G) = 

'\F\x\A(G)\+ cost (F',G) 

F'eF — t H (F) 

|G|x |„4(F) I + cost(G',F) 

G'ea-y H (G) 
|F|x|^(G,r i (G))|+ J2 cost(F',G) 

= min J F'eF--,HF) 

I |G| x \F(F,T L (F))\+ Y, cost(G',F) 

G'eG—, L (G) 

\F\x\F(G,r R (G))\+ J2 cost(F',G) 

F'eF-j R (F) 

\G\x\T(F,T R (F))\+ cost(G',F) 
Figure 5: The cost formula. 



Theorem 1. The cost formula in Figure 5 computes the 
cost of the optimal LRH strategy for the general tree edit 
distance algorithm GTED. 

Proof. Proof by induction over the structure of F and 
G. In the base case, both F and G consist of a single node 
and cost(F,G) = 1. Inductive hypothesis: cost(F',G) and 
cost(F, G') for the relevant subtrees F' and G' with respect 
to the left, right, and heavy path of F and G are optimal. 
We show that the optimality of cost(F, G) follows. There 
are six possible paths for decomposing F and G: left, right, 
or heavy in either F or G. If 7 is a path in F, the distance 
of G to all relevant subtrees of F with respect to 7 must be 
computed, which can be done in cost(F' , G) steps 

(inductive hypothesis). Further, depending on the path type 
of 7, a single path function must be called: A L for the left 
path, A R for the right path, and A 1 for the heavy path. 
The costs follow from Lemma 4 and are added to the cost 
sum for the relevant subtrees. Calling A 7 for left or right 
paths cannot lead to smaller cost since F(F, T L ^ R ) C A(F). 
Similar rationale holds if 7 is a path in G. The cost formula 
in Figure 5 takes the minimum of the six possible costs. □ 

The single-path function, A 7 , which is used for heavy 
paths in GTED, is not optimal since it computes subprob- 
lems that are not required. It is, however, the only known 
algorithm for heavy paths that runs in 0(n 2 ) space. A 7 can 
easily be substituted with another single-path function by 
changing the respective costs in the cost formula. 

6. RTED: ROBUST TREE EDIT DISTANCE 
ALGORITHM 

The robust tree edit distance algorithm, RTED, computes 
the optimal LRH strategy for two trees and runs the GTED 
algorithm presented in Section 4 with the optimal strategy. 
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The optimal strategy for two trees, F and G, is computed 
with the cost formula in Figure 5 by memorizing at each 
recursive step the best root-leaf path. Since there is one path 
for each pair of relevant subtrees, the paths can be stored 
in an array of size \F\ x G|, the strategy array, which we 
initialize with empty paths. The strategy array maps each 
pair of subtrees to a root-leaf path in one of the subtrees, 
thus the strategy array is a path strategy (cf. Definition 4). 

The key challenge is to compute the optimal strategy ef- 
ficiently. The exhaustive exploration of the search space of 
exponential size is obviously prohibitive. We observe, how- 
ever, that the cost is computed only between pairs of sub- 
trees of F and G, and there is only a quadratic number of 
such subtrees. This suggests a dynamic programming ap- 
proach, which stores the intermediate results and traverses 
identical branches of the search tree only once. 

6.1 Baseline Algorithm for Optimal Strategy 

The baseline algorithm for computing the optimal strat- 
egy implements the cost formula and traces back the optimal 
strategy. It stores intermediate results and uses dynamic 
programming to avoid computing the strategy for identical 
pairs of subtrees more than once. The intermediate results 
are stored in a memoization matrix of size \F\ x \G\, and 
the optimal strategy is stored in the strategy array. At each 
recursive step, four actions are performed for F and G: 

• If the cost for F and G is in the memoization matrix, 
return the cost and ignore the following steps. 

• Compute the costs for G (F) and the relevant subtrees 
of F (G) w.r.t. the left, right, and heavy paths; sum 
up the values and compute a cost for each path. 

• Store the path with smallest cost as the entry for F 
and G in the strategy array. 

• Store the cost for F and G in the memoization matrix. 

Theorem 2. The runtime complexity of the baseline al- 
gorithm for two trees F and G is bound by 0(n 3 ), n = 
max(|Fj, \G\), and the bound is tight. 

PROOF. The runtime of the baseline algorithm is deter- 
mined by the number of sums that must be computed. We 
proceed in three steps: (1) count the number of the summa- 
tions, (2) show the 0(n ) upper bound for the complexity, 
(3) give an instance for which this bound is tight. 

(1) Summation count: The cost for a subtree pair (F v , G w ) 
is the minimum of six values (cf. Figure 5). Computing each 
value requires \F — 7| summations: The cost of the single- 
path function (product) is computed in constant time since 
the factors can be precomputed in Od-FI + \G\) time with 
Lemmas 1 and 3; adding the costs for the relevant subtrees 
w.r.t. path 7 requires \F — 7I — 1 summations. Since we store 
the results, the cost for each pair (F V ,G W ), v G F, w G G, 
must be computed at most once. For some subtree pairs no 
cost might be computed, for example, when the root node 
of one of the subtrees is the only child of its parent. Overall, 
this leads to an upper bound for the number of summations: 

#sums< (\F V -J L (F V )\ + \F V - 1 R (F V )\ + 

\F v -~, H (F v )\ + \G w -~/ L (G w )\ + (2) 
\G W ~j R (G w )\ + \G W -i H (G w )\) 



(2) Upper bound: The number of relevant subtrees of F v 
with respect to 7 G ^y*(F v ) is limited by the number of leaf 
nodes of F v , \F V — 7I < |Z(-Fu)| — 1. Substituting in (2) we get 
anew (looser) upper bound, #sum < ~}2 veF weG (3(\l(F v )\ — 
1) + 3(|Z(G TO )|-1)). There are at most |FjjG| different pairs 
of subtrees, \l(F v )\ < \F\, \l(G w )\ < \G\, thus #sums < 
|F||G|(3|F| +3|G|)) = 0(n 3 ). 

(3) Tightness of bound: We show that for some instances 
the runtime of the baseline algorithm is fl(n 3 ). Let F be 
a left branch tree (Figure 7(a)) and G a right branch tree 
(Figure 7(b)). For a subtree, F v , rooted in a non-leaf u of F, 



it holds that \F v -j L (F v )\ 



\F V - 1 H (F V )\ = 



F v —~y R (F v ) \ — 1; if v is a leaf, then | F v — 7 = for any path. 
Similar, for subtree G w and non-leaf w, \G W — ~f L (G w )\ = 1, 
\G W - j H (G w )\ = J^i, \G W - J H (G W )\ = ^^i; 
\G W — 7 1 = if w is a leaf. For F and G, every pair of 
subtrees occurs during the computation of the baseline al- 
gorithm, thus the right-hand term in (2) is the exact number 
of summations. By substituting in (2) we get: 



#sums 



E + i + ^ + 



u6F\i(F),i»eG\l(G) 



\G W \-1 \G W \-U 
2 2 ' 



E 

vel(F),w£G\l(G) 

E 

v£F\l(F),w£l(G) 



[1+ \G^_1 + \G^_1 ) + 



\F V \-1 \F V \-U 
9 9 / 



1(F) and 1(G) are leaf nodes of F and G, respectively. With 
\l(F)\ = (\F\ + l)/2, \l(G)\ = (|G| + l)/2, we get #sum = 
n(lflifl(|F| + |G|) + iflifl|G| + Mifl|F|) = fi(n 3 ). □ 



6.2 Efficient Algorithm for Optimal Strategy 

The runtime of the baseline algorithm is 0(n 3 ). While 
this is clearly a major improvement over the naive exponen- 
tial solution, it is unfortunately not enough in our applica- 
tion. The complexity for computing the strategy must not 
be higher than the complexity of the optimal strategy for 
GTED. The optimal GTED strategy is often better than 
cubic, e.g., 0(n 2 log 2 n) for trees of depth log(n). 

In this section we introduce OptStrategy (Algorithm 2), 
which computes the optimal LRH strategy for GTED in 
0(n 2 ) time. Similar to the base line algorithm, a strategy 
array, STR, of quadratic size is used to store the best path 
at each recursive step, resulting in the best overall strategy. 

Different from the base line algorithm, we do not store 
costs between individual pairs of relevant subtrees. Instead 
we maintain and incrementally update the cost sums of the 
relevant subtrees (summations over the relevant subtrees in 
the cost formula). We do not sum up the same cost multi- 
ple times and thus reduce the runtime. The cost sums are 
stored in cost dTvciys'. Ly) Rv-) H v of size \F\ x G store 
a cost sum for each pair (F V ,G W ), v G F, w G G, for the 
left, right, and heavy path in F v , respectively; for example, 
L V [F V ,G W ] = ^2f'ef v — 7 L (f v ) c ost(F ,G W )- L w , R w , H w of 
size |G| store the cost sums between all relevant subtrees of 
G w.r.t. some path and a specific subtree F v , for example, 

L W [G W ] = Y^G'£G w -y L (G w ) COSt(G',F„). 
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Algorithm 2: OptStrategy(F, G) 



1 L V ,R V ,H V : arrays of size F x G 
1 L w , R w , H w : arrays of size \G\ 
3 for v = 1 to \F\ in postorder do 
for w — 1 to \G\ in postorder do 

if w is leaf then L w [w] <— R w [w] <— H w [w] 4— 
if v is leaf then 

L v [v, w] <— R v [v, w] H v [v, w] <— 
C<- {(|F„| x \A{G w )\+H v [v,w},j H (F)), 
(\G W \ x |^(F„)| + H W H,7 H (G)), 
(|F„| x \F(G w ,r L (G))\+L v [v,w],~/ L (F)), 
(\G W \ x \T(F v ,r L (F))\+L w [w] n L (G)), 
(\F v \x\F(G w ,T R (G))\+R v [v,w],j R (F)), 
(\G W \ x |^(F 1 ,,r JJ (F))|+ J R 1 „H,7 ii (G))} 
(c mi „,7 ml „) «- (c, 7) such that (c,j) G G and 

c = min{c' I (c',7') G C*} 
ST7?[v,w] ^ 7mln 
if v is noi root then 

l v [v,w] a v e j L (f p(v) ) 

Cmin otherwise 
R v [v,w] if v € j R {F p(v) ) 
Cmin otherwise 

H v [v,w] if v G J H (F p(v) ) 
Cmin otherwise 



4 
5 
6 

7 
8 
9 
10 
11 
12 
13 

14 
15 

16 
17 



18 

19 
20 

21 

22 



L v [p(v),w] = 
R v [p{v),w] = 
H v [p(v),w] = 



if w is not root then 

L w [w] if w G 7 L (G P („)) 

Cmin otherwise 

iCM if w G 7 ii (G p( „ ) ) 

Cmin otherwise 



L„,[p(u>)] = 
Rw\p(w)] ± 



^ b(w)] ±J^M if^G 7 H (G p(t „)) 
c m in otherwise 



23 return STR 



The algorithm loops over every pair of subtrees (F V ,G W ) 
in postorder of the nodes v G F, w G G. The cost sum 
for a leaf node is zero, because leaves have no relevant sub- 
trees (Lines 5-6). The values in the cost arrays are used to 
compute the costs of the pair (F V ,G W ) for each of the six 
possible paths in the cost formula (Lines 7-12). For each 
result a (cost, path) pair is stored in the temporary set C. 
The minimum cost in C is assigned to c min , the respective 
path 7mi„ is stored in the strategy array. Finally, the cost 
sums for the subtree pairs (Fp^ v ),G w ) and (F v ,Gp( w )) are 
updated, where p(v) and p(w) are the parents of v and w, 
respectively. The update value depends on whether v and 
w belong to the same path as their parent. 

Example 4. We use Algorithm 2 to compute the opti- 
mal strategy for two trees F and G, N(F) = {1,2,3}, 
F(F) = {(3,1), (3, 2)}, N(G) = {1,2}, F(G) = {(2,1)}; 
for simplicity, node IDs and labels are identical and cor- 
respond to the postorder position in the tree. Figure 6 
shows the cost arrays and the strategy array before the last 
node pair, v = 3, w = 2, is processed. Array rows and 
columns are labeled with node IDs, e.g., the cost sum for 



the subtree pair (Fs,Gi) w.r.t. the heavy path in F3 ts 
-Ht,[3, 1] = 1. We compute the missing value in the strat- 
egy array STR: Neither v nor w are leaves; with \G W \ = 
\A{G W )\ = \J r (G w ,T L (G))\ = \T(G w ,r R (G))\ = 2, |F„| = 
3, |^(F„)| = \T(F v ,r L (F))\ = \T(F V ,T R (F))\ = 4 we 
compute C = {(3*2 + 2,-y H (F z )), (2*4 + 0,j H (G 2 )), 
(3*2 + 2,7 L (F 3 )), (2*4 + 0,7 L (G 2 )), (3 * 2 + 2, ~f R {F 3 )), 
(2*4+0, 7 fl (G 2 ))} = {(8,7^3)), (8, 7 H (G 2 )), (8,7^3)), 
(8,7 L (G 2 )), (8,7 fl (F 3 )), (8,7 fl (G 2 ))}. In the example, all 
costs are identical, and we arbitrarily pick (c m in, 7min) = 
(8, 7 H (Fa)) as the minimum; the missing value in the strat- 
egy array is J H {F3). Since both v and w are roots, the algo- 
rithm terminates and returns the optimal strategy. 
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Figure 6: Cost arrays and strategy array. 

Theorem 3. Algorithm 2 is correct. 

Proof. The strategy array STR maps every pair of sub- 
trees to a root-leaf path and thus is a strategy according to 
Definition 4. Next we show by induction that the cost ar- 
rays store the correct values according to the cost formula. 
Base case: For pairs of leaf nodes (v, w) the cost arrays store 
zeros; this is correct due to F v —^f v = G w —~fG m — 0- Induc- 
tive hypothesis: For all pairs of children (v, w) of two nodes, 
/ = p(v) and g — p(w), the values in the cost arrays are cor- 
rect. We show that, after processing all children, the values 
for / and g are correct. This implies the correctness of the 
overall algorithm since the nodes are processed in postorder, 
i.e., all children are processed before their parent. 

We consider two cases for node v (for w analogous rea- 
sonings hold): (1) node v lies on the same root-leaf path 
as its parent, (2) node v does not lie on the same root-leaf 
path as its parent. Case 1: F v is not a relevant subtree of 
Ff with respect to jp.. The cost already stored for v is the 
sum of the costs for every relevant subtree F' G F v — jf v , 
i.e., a part of the sum ~}2 F , eF cost(F', G w ). We incre- 

ment the value in the cost array of fF f with the cost already 
stored for v. Case 2: F v £ Ff — jF f is the root of some rel- 
evant subtree of Ff with respect to jF f ■ The cost of v is 
an element in the cost sum for /, J^F'eF -7 cost(F", G w ). 
We add the cost value of v (c m in) to the cost entry of /. □ 

Theorem 4- Time and space complexity of Algorithm 2 
are 0(n 2 ), where n = max(|Fj, |G|). 

Proof. Algorithm 2 iterates over all pairs of subtrees 
(F„, G w ), v G F,w G G, thus the innermost loop is executed 
|F||G| times. In the inner loop we do a constant number of 
array lookups and sums. The factors of the six products in 
Lines 7-12 are precomputcd in 0(|F| + |G|) time and space 
using the Lemmas in Section 5.2. We use four arrays of 
size |F| x |G| and three arrays of size |G|, thus the overall 
complexity is 0(|F||G|) in time and space. □ 
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7. RELATED WORK 

The first tree edit distance algorithm for the unrestricted 
edit model, which is also assumed in our work, was al- 
ready proposed in 1979 by Tai [28]. Tai's algorithm runs 
in 0(m 3 n 3 ) time and space for two trees F and G with m 
and n nodes, respectively. Zhang and Shasha [31] improve 
the complexity to 0(m 2 n 2 ) time and 0{mn) space; for trees 
with I leaves and depth d the runtime is 0{mnmm(lF ,(If) 
mm(lc , da)) , which is much better than 0(m 2 n 2 ) for some 
tree shapes. Klein [22] uses heavy paths [27] to decompose 
the larger tree and gets an 0(n 2 m log m) time and space 
algorithm, m > n. Demaine et al. [15] also use heavy paths, 
but different from Klein they switch the trees such that the 
larger subtree is decomposed in each recursive step. Their 
algorithm runs in 0(n 2 m(l + log ™)) time and in 0(mn) 
space, m > n. Although Demaine's algorithm is worst-case 
optimal [15], it is slower than Zhang's algorithm for some 
interesting tree shapes, for example, balanced trees. Our 
RTED algorithm is efficient for all tree shapes for which any 
of the above algorithms is efficient. The runtime of RTED 
is worst-case optimal and the space complexity is 0(mn). 

Similar to our approach, Dulucq and Touzet [16] compute 
a decomposition strategy in the first step, then use the strat- 
egy to compute the tree edit distance. They only consider 
strategies that decompose a single tree and get an algorithm 
that runs in 0(n 2 m log m) time and space. Our algorithm 
requires only 0(mn) space. Further, we consider strategies 
that decompose both trees, which adds complexity to the 
model, but is required to achieve worst-case optimality: our 
runtime is 0{n 2 m{l + log — )), which is 0(n 3 ) if n = m. 

None of the above works includes an empirical evaluation 
of the algorithms. We evaluate our RTED algorithm on 
both synthetic and real world data and compare it to the 
solutions of Zhang and Shasha [31], Klein [22], and Demaine 
et al. [15]. We further provide a formal framework, which 
extends previous work by Dulucq and Touzet [16], and an 
algorithm that generalizes all above approaches. 

For specific tree shapes or a restricted set of edit oper- 
ations faster algorithms have been proposed. Chen [11] 
presents an 0(mn + l 2 F n + l 2 p 5 lc) = 0(m 2 5 n) algorithm 
based on fast matrix multiplication, which is efficient for 
some instances, e.g., when one of the trees has few leaves. 
Chen and Zhang [10] present an efficient algorithm for trees 
with long chains. By contracting the chains they reduce the 
size of tree F (G) from mtom(n to n) and achieve runtime 
0{mn + rh 2 n 2 ). Chawathe [8] restricts the delete operation 
to leaf nodes. An external memory algorithm that reduces 
the tree edit distance problem to a well-studied shortest path 
problem is proposed. Other variants of the tree edit distance 
are discussed in a survey by Bille [7] . Our algorithm adapts 
to any tree shape and we assume the unrestricted edit model. 

Tree edit distance variants are also used for change detec- 
tion in hierarchical data. Lee et al. [24] and Chawathe et 
al. [9] match tree nodes and compute a distance in O(ne) 
time, where e is the distance between the trees. Cobena et 
al. [12] take advantage of element IDs in XML documents, 
which cannot be generally assumed. The X-Diff algorithm 
by Wang et al. [29] allows leaf and subtree insertion and dele- 
tion, and node renaming. In order to achieve 0(n 2 x / log /) 
time complexity for trees with n nodes and maximum fanout 
/, only nodes with the same path to the root are matched. 

Lower and upper bounds of the tree edit distance have 
been studied. Guha et al. [18] propose a lower bound based 



on the string edit distance between serialized trees and an 
upper bound based on a restricted tree edit distance vari- 
ant. Yang et al. [30] decompose trees into so-called binary 
branches and derive a lower bound from the number of 
non-matching binary branches between two trees. Similarly, 
Augsten et al. [4, 5] decompose the trees into pq-grams and 
provide a lower bound for an edit distance that gives higher 
weight to nodes with many children. The bounds have more 
efficient algorithms than the exact tree edit distance that 
we compute in this paper. Bounds are useful to prune exact 
distance computations when trees are matched with a sim- 
ilarity threshold. There is no straightforward way to build 
bounds into the dynamic programming algorithms for the 
exact tree edit distance to improve its performance. 

The TASM (top-fc approximate subtree matching) algo- 
rithm by Augsten et al. [2] identifies the top-fc subtrees in 
a data tree with the smallest edit distances from a given 
query tree. TASM prunes distance computations for large 
subtrees of the data tree and achieves a space complexity 
that is independent of the data size. The pruning makes use 
of the top-fc guarantee, which is not given in our scenario. 

Garofalakis and Kumar [17] embed the tree edit distance 
with subtree move as an additional edit operation into a nu- 
meric vector space equipped with the standard L\ distance 
norm. They compute an efficient approximation of the tree 
edit distance with asymptotic approximation guarantees. In 
our work, we compute the exact tree edit distance. 

In this work we assume ordered trees. For unordered trees 
the problem is NP-hard [32]. Augsten et al. [3] study an 
efficient approximation for the unordered tree edit distance. 

8. EXPERIMENTS 

We empirically evaluate our solution (RTED) on both 
real world and synthetic datasets and compare it to the 
fastest algorithms proposed in literature: the algorithm by 
Zhang and Shasha [31] (Zhang-L), which uses only left paths 
to decompose the trees; the symmetric version of this al- 
gorithm that always uses right paths (Zhang-R); Klein's 
algorithm [22] (Klein-H), which uses heavy paths only in 
one tree; the worst-case optimal solution by Demaine et 
al. [15] (Demaine-H), which decomposes both trees with 
heavy paths. Our implementation of Klein's algorithm in- 
cludes the improvements proposed by Demaine et al. [15] 
and runs in quadratic space. All algorithms were imple- 
mented as single-thread applications in Java 1.6 and run on 
a dual-core AMD64 server. 

The Datasets. We test the algorithms on both synthetic 
and real world data. We generate synthetic trees of six 
different shapes. The left branch, right branch, and the 
zigzag tree (Figure 7) are constructed such that the strate- 
gies Zhang-L, Zhang-R, and Demaine-H are optimal, respec- 
tively; for the full binary tree both Zhang-L and Zhang-R 
are optimal; the mixed tree shape does not favor any of the 
algorithms; the random trees vary in depth and fanout (with 
a maximum depth of 15 and a maximum fanout of 6). 

We choose three real world datasets with different charac- 
teristics. SwissProt 1 is an XML protein sequence database 
with 50000 medium sized and flat trees (maximum depth 
4, maximum fanout 346, average size 187); TreeBank 2 is an 
XML representation of natural language syntax trees with 

: http : / /www . expasy . ch/ sprot/ 

2 http : / /www . cis . upenn . edu/~treebank/ 
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(a) Left branch (b) Right branch (c) Full binary tree 
tree (LB) tree (RB) (FB) 



(d) Zig-zag tree 
(ZZ) 



(e) Mixed tree (MX) 
Figure 7: Shapes of the synthetic trees. 



56385 small and deep trees (average depth 10.4, maximum 
depth 35, average size 68). TreeFam 3 stores 16138 phylo- 
genetic trees of animal genes (average depth 14, maximum 
depth 158, average fanout 2, average size 95). 

Number of Relevant Subproblems. We first compare the 
number of relevant subproblems computed by each of the 
algorithms for different tree shapes. The relevant subprob- 
lems are the constant time operations that make up the 
complexity of the algorithm. We create trees with different 
shapes (the five shapes shown in Figure 7 and random trees) 
with sizes varying between 20 and 2000 nodes. We count 
the number of relevant subproblems computed by each of 
the algorithms for pairs of identical trees. The results are 
shown in Figure 8. Each of the tested algorithms, except 
RTED, degenerates for at least one of the tree shapes, i.e., 
its asymptotic runtime behavior is higher than necessary. 
RTED either wins together with the best competitor (left 
branch, right branch, full binary, zigzag), or is the only win- 
ner (random, mixed), which confirms our analytic results. 
Note that the trees, for which RTED is even with another 
algorithm, were created such that the strategy of one of the 
competitors is optimal. The differences are substantial, for 
example, for the left branch trees with 1700 nodes (Fig- 
ure 8(a)) Zhang-R produces 2290 times more relevant sub- 
problems than RTED; for the mixed trees with 1600 nodes, 
the best competitor of RTED (Zhang-L) does 8.5 and the 
worst competitor (Klein-H) 30 times more computations. 

Runtime on Synthetic Data. In Figure 9 we compare the 
runtimes of Zhang-L, Demaine-H, and RTED for different 
tree shapes. The runtime for the full binary tree is shown 
in Figure 9(a). Zhang-L and RTED scale well with the tree 
size, whereas the runtime of Demaine-H grows fast. This is 
expected since Demaine-H must compute many more sub- 
problems than the other two algorithms (cf. Figure 8(c)). 
The strategy of Zhang-L is optimal for the full binary tree 
such that Zhang-L and RTED compute the same number 
of subproblems. The overall runtime of RTED is higher 
since RTED pays a small additional cost for computing the 
strategy. Further, our implementation of Zhang-L is opti- 
mized for the hard-coded strategy, such that the runtime 
per subproblem is smaller for Zhang-L than for RTED (by 
a constant factor below two). For other tree shapes, this 
advantage of Zhang-L is outweighed by the smaller number 
of subproblems that RTED must compute. For the zig-zag 
trees in Figure 9(b), Zhang-L is slower than both RTED and 
Demaine-H. RTED's overhead for the strategy computation 
is negligible compared to the overall runtime, and RTED is 



slightly faster than Demaine-H. For the mixed tree shapes 
in Figure 9(c), RTED scales very well, while the runtimes of 
both Zhang-L and Demaine-H grow fast with the tree size. 

Scalability of Similarity Join. We generate a set of trees, 
T = {LB, RB, FB, ZZ, Random}, with different shapes and 
approximately 1000 nodes per tree. We perform a self join 
on T that matches two trees Ti, T 2 € T if TED(Ti, T 2 ) < r. 
Table 1 shows the runtime (average of three runs) and 
the number of relevant subproblems computed in the join. 
RTED widely outperforms all other algorithms. Different 
from the previous experiment, where we tested on pairs of 
identical trees, the join computes the distance between all 
pairs of trees, regardless of their shape. The competitors 
of RTED degenerate for some pairs of trees with different 
shapes, leading to high runtimes. For example, both Zhang- 
L and Zhang-R run into their worst case for pairs of unbal- 
anced trees LB and RB. Unbalanced trees appear frequently 
in practice, for example, in the phylogenetic dataset. The 
runtime per relevant subproblem differs between the solu- 
tions and is the smallest for Zhang-L and Zhang-R. 



Algorithm 


Time [sec] 


#Rel. subproblems 


Zhang-L 


694 


26.76 x 10 9 


Zhang-R 


908 


27.13 x 10 9 


Klcin-H 


2483 


41.82 x 10 9 


Demaine-H 


938 


17.62 x 10 9 


RTED 


140 


1.96 x 10 9 



http : / / www . treef am . org/ 



Table 1: Join on trees with different shapes. 

Overhead of Strategy Computation. RTED computes the 
optimal strategy before the tree edit distance is computed. 
We measure the overhead that the strategy computation 
adds to the overall runtime. We run our tests on three 
datasets: SwissProt, TrccBank, and a synthetic data set 
with random trees that vary in size, fanout, and depth. We 
pick tree pairs at regular size intervals and compute the tree 
edit distance. For a given tree size n we pick the two trees 
in the dataset that are closest to n; the size value used in 
the graphs is the average size of the two trees. Figure 10 
shows the runtime for computing the strategy for a pair of 
trees and the overall runtime of RTED. The strategy com- 
putation scales well and the fraction it takes in the overall 
runtime decreases with the tree size. The spikes in the over- 
all runtime are due to the different tree shapes, for which 
more or less efficient strategies exist. The runtime for the 
strategy computation is independent of the tree shape. The 
results confirm our analytic findings in Section 6. 

Scalability on Real World Data. We measure the scala- 
bility of the tree edit distance algorithms for the TreeFam 
dataset with phylogenetic trees. We partition the dataset by 
size (less than 500, 500-1000, more than 1000 nodes) and 
compute the distance between pairs of trees of two parti- 
tions. We measure the performance of RTED as the percent- 
age of relevant subtree computations that RTED performs 
with respect to the best (Table 2(a)) and the worst com- 
petitor (Table 2(b)). The best and worst competitors vary 
between the pairs of partitions. The tables show the results 
for random samples of size 20 from each partition. RTED 
always computes less subproblems than all its competitors 
(84.2% to 94.4% w.r.t. the best competitor, 5.6% to 30.6% 
w.r.t. the worst competitor). The advantage increases with 
the tree size. For the partitions with the largest trees, RTED 
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Figure 8: Number of relevant subproblems for different algorithms and tree shapes. 
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Figure 9: Runtime of the fastest tree edit distance algorithms for different tree shapes. 
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(b) RTED to the worst competitor 
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17.3% 
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21.1% 


8.9% 


7.9% 


>1000 


18.3% 


7.7% 


5.6% 



Table 2: Ratio of relevant subproblems computed by 
RTED w.r.t. the (a) best and (b) worst competitor. 

produces 18 times less relevant subproblems than the worst 
competitor. This experiment shows the relevance of RTED 
in practical settings. The wrong choice among the com- 
peting algorithms for a specific dataset may lead to highly 
varying runtimes. RTED is robust to different tree shapes 
and always performs well. 

9. CONCLUSION 

In this paper we discussed the tree edit distance between 
ordered labeled trees. We introduced the class of LRH 



strategies, which generalizes previous approaches and in- 
cludes the best algorithms for the tree edit distance. Our 
general tree edit distance algorithm, GTED, runs any LRH 
strategy in 0(n 2 ) space. We developed an efficient algo- 
rithm for computing the optimal LRH strategy for GTED. 
The resulting algorithm, RTED, runs in 0(n 2 ) space as its 
best competitors and the 0(n 3 ) runtime of RTED is worst- 
case optimal. Compared to previous algorithms, RTED is 
efficient for any tree shape. In particular we showed that 
the number of subproblems computed by RTED is at most 
the number of subproblems that its best competitor must 
compute. Our empirical evaluation confirmed that RTED is 
efficient for any input and outperforms the other approaches, 
especially when the tree shapes within the dataset vary. 
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